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We introduce a novel generalization of the discrete nonlinear Schrodinger equation. It supports 
solitons that describe how proteins fold. As an example we scrutinize the villin headpiece HP35, 
an archetypal protein for testing both experimental and theoretical approaches to protein folding. 
Using explicit soliton profiles we construct its carbon backbone with an unprecedented accuracy. 



The discrete nonlinear Schrodinger equation [I] is a prime example of a universal nonlinear equation. The equation 
originally appeared in connection of a study of polarons in molecular crystals [2J. It supports both stationary and 
time dependent soliton solutions that were first introduced to describe Davydov solitons in proteins [3j, then found 
applications to the crystalline state of acetanilide [3] , and subsequently emerged in the study of optical waveguides and 
Bose- Einstein condensates [S]. Today the discrete nonlinear Schrodinger equation together with its generalizations 
(GDNLS) form a very actively studied family of nonlinear equations that are widely employed to describe a multitude 
of phenomena in disparate physical, chemical and biological scenarios PQ-0. 

Here we introduce a novel generalization of the discrete nonlinear Schodinger equation that governs the organizing 
principle for protein folding [7] , arguably among the most important unresolved phenomena in modern science. Our 
version of the GDNLS equation stems from a discrete lattice model introduced in [5] to describe the statistical 
properties of folded chiral homopolymers. A recent Monte Carlo investigation [5] has suggested that this model 
might support soliton-like solutions, and furthermore that these solitons might accurately model the folded protein 
structures that are stored in the Protein Data Bank (PDB) [10]. The goal of the present article is to adapt and develop 
the powerful exact and numerical techniques of GDNLS equations to address and resolve the organizing principles 
that underlie protein folding, whereupon a folded protein becomes very accurately described by a set of heteroclinic 
standing wave solutions i.e. dark solitons of an appropriate GDNSL equation. 

Our GDNLS equation for protein folding originates from the following energy functional [5], [5], 
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We take K% to be periodic, Ki E [— tt,it] mod(27r). It is our primary variable and subject to both local and nearest- 
neighbor interactions. In applications to protein folding we identify Kj with the discrete signed Frenet curvature of 
the protein backbone, at the position of the i th C a carbon. The variable n € [— 7r,7r] mod(27r) is a periodic auxiliary 
variable and only subject to local interactions, it describes the discrete Frenet torsion at the site i of the protein 
backbone. Finally, (6, c, d, e, m, q) are global parameters, they are specific to a given secondary superstructure. 

Our GDNLS equation emerges as follows: We first eliminate the auxiliary variable by varying the energy functional 
with respect to Tj. This gives us an equation of motion to resolve for n in terms of «j, 

dE 2 2 

— = 2bn i Ti + 2eTj + d + qn~ = 



We then perform a variation of the energy functional with respect to Ki , and substitute r, [Ki] from ^ into the ensuing 
equation of motion to arrive at our GNLS equation 

K i+ i - 2Ki + Ki-i = U'[Ki]Ki = ^-y- Ki (i = l,...,N) (3) 
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(with kq — kn+i — 0.) This equation determines the stationary points of the following GDNLS Hamiltonian 
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where the potential has the following functional form 

U[k] = - 
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Here the second and the third term are familiar in the context of the nonlinear Schrodinger equation [T]-[6]. But the 
first term appears to be novel in the present context, it resembles the potential term for the relative coordinate in the 
two-body Calogero model [llj . 

If we properly choose the parameters in so that the potential 17 [re] has two separate local minima, we can easily 
extend the results in |12j to ensure the existence of a dark soliton solution that interpolates between these two minima. 
Such a qualitative form of 17 [re] typically follows if away from the vicinity of re = the potential becomes dominated 
by the second contribution to E in (JlJ. This is the familiar double- well potential term, with minima at re = ±m. It 
turns out that in applications to protein folding the parameters should indeed be chosen in this manner and a dark 
soliton is a configuration that interpolates from the ground state in the vicinity of k\ rs ±m to the ground state in 
the vicinity of kn sa =pm, as we traverse the backbone. When we compute «j from |3| and tj from ^ and integrate 
the ensuing discrete Frenet equation we obtain a iV-vertex polygonal chain such that a ground state with re ~ ±m 
and t given by Q is a helix, with the dark soliton describing a loop that connects two helices. 

We follow [12] to solve Q iteratively by locating a fixed point of 
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Here {re^ }ieN denotes the n th iteration of an initial configuration {re^}jgjv and e is some sufficiently small but 
otherwise arbitrary numerical constant. It is obvious that a fixed point of Q satisfies the GDNLS equation Q. 

In our simulations we start from an initial configuration {re^}j 6 jy chosen to have the same overall topology as 
the desired dark multi-soliton solution. We take rej ^ to have the profile of a piecewise constant step-function, the 
constant values approximate the true potential minimum. They correspond to the a-helices and /3-strands in the 
protein backbone. There is a step with a change of sign in k\ ^ at each lattice site i = N a where a backbone loop is 
centered. Notice that as it stands, the energy functional has the re <H> — re reflection symmetry that may not be 
exactly realized by the desired dark soliton profiles - the a helices are not ideal, and there are proteins where a loop 
connects an a-helix with a /3-sheet. Thus we explicitely break this symmetry using the parameter m, and for this we 
set 



m — > m a for iV a _i < i < N a 



along the chain. Typical values for m a are m a ~ ±it/2 for a-helix, and m a f=a ±1 for /3-strand. 

We have performed extensive numerical investigations of the dark soliton solutions to Q. We have found that for 
proper values of the parameters these solitons can be combined into multi-solitons that together with ([2| give a very 
high accuracy approximation of various folded protein structures that are stored in the Protein Data Bank |10) . with 
the a-helices and /3-strands as the ground states and interpolated by dark solitons that describe the protein loops. 

As an example we here scrutinize the dark two-soliton that models the chicken villin headpiece subdomain HP35 
(PDB code 1YRF), a naturally existing 35-residue protein that has three a-helices separated from each other by two 
loops. The structure of HP35 is very robust and since the protein is also a very fast folder, the folding time is around 
4/xs, together with the engineered version (2F4K in PDB) and the very similar HP36 (1VII in PDB), the HP35 has 
become the subject to very extensive studies both experimentally [32]- [H] and theoretically [T7]-[2U]. Indeed, HP35 
is now a paradigm platform for testing approaches to protein folding. 

According to [H], the root mean square distance (RMSD) between the NMR spectroscopy and the x-ray crystallog- 
raphy structures of HP35 is around 1.3 A for the C a carbons. The overall resolution of the presumably more accurate 
x-ray data is 1.07A [To] . 

The authors of |17j-|20j report on the construction of native and near-native folds using various methods and with 
both explicit and implicit water. For example the proposed native fold in |19j deviates in average around 1.63 A in C a 
RMSD from the x-ray data [T5] for the sites 2-34 (counting from the N-terminus). The article also describes a single 
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trajectory that reaches a value of 0.39 A in RMSD i.e. a distance that is about half the radius of a single carbon 
atom [sic]. The authors of [20] report very similar results, with a proposed native fold average C a RMSD around 1.54 
- 1.65 A for the sites 2-34. They also report on a single trajectory that reaches C a RMSD value 0.55 A. 

We shall now explain how the dark solitons of Q quite effortlessly enable us to construct a backbone with 0.74 A 
RMSD accuracy for the C a carbons, for the sites 3-33 (counting from the N-terminus); The reason we do not consider 
the entire chain is that in order to compute the local curvature from the three dimensional space coordinates we need 
to know the coordinates of three adjacent C a carbons, and for the computation of local torsion we need four. 

We convert the PDB data for the C a carbons to the local curvature and torsion. The result is shown in Figure 1. 
From the Ki profile we conclude that the C a backbone of 1YRF consists of two dark solitons. These correspond to 
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FIG. 1: (Left) : The bond angles Ki of 1YFR (red) for the sites 3-33 (45-78 in the PDB indexing convention) and their 
approximation by a soliton solution to equation (blue). (Right) : The torsion angles n of 1YRF (red) for the sites 3-33 
(45-78 in the PDB indexing convention) and their approximation by a soliton solution to equation ^ (blue). 

the two loops of 1YRF and are located around the sites 49-53 (PDB indexing) and 58-62 in Figure 1, respectively. 
These solitons interpolate between ground states that correspond to the three a-helices of 1YRF. The first helix is 
located between the sites 42-49, the second between the loops around sites 53-58 and the third occupies the remaining 
sites starting from 62 in Figure 1. While the two soliton profiles {Ki} are clearly identifiable, the profile of {t^} is 
substantially less regular and a priori one may expect that the strong irregularity in {r^} reflects the amino acid 
differences in the side chains. However, we find that this is not the case. The {r^} profile can be computed very 
accurately from ^ in terms of the soliton profile Ki, the apparent irregularity reflects solely the mod(27r) multivalued 
character of a periodic variable. 

To construct the soliton profile we introduce for each of the two would-be solitons the parameters (b, c, d, e, m, q): 
There is one set of parameters for the sites z=3-13 (counting from TV terminus) and another set of parameters for the 
remaining sites. We construct the ensuing soliton solution of ^ by iterating Q to a fixed point, starting from the 
initial profile which is a step-function located at the solitons. We compute the RMSD between the fixed point and 
1YRF. We then change the parameters randomly and compute the new soliton profile, always starting from the same 
piecewise constant initial profile for the k^ . We compare its RMSD to 1 YRF with that obtained for the first set of 
initial parameters using the standard Metropolis algorithm deviced to minimize RMSD. By repeating these steps in 
combination with simulated annealing we eventually produce our final soliton solution. The construction of a folded 
structure takes about 10 hours using a single processor in a MacPro desktop computer 

Figure 2 compares our minimal RMSD two-soliton configuration with the 1YRF backbone constructed from the 
x-ray data, for the sites i=3-33. The RMSD between the two configurations is 0.74 A, well below the overall resolution 
of the x-ray data (which is 1.07 A). Consequently our dark soliton pair describes the native 1YRF backbone for all 
practical purposes, and with an accuracy comparable to that of the radius 0.70 A of a carbon atom. In Table 1 we 
provide the parameter values for this configuration, together with the parameter values for the best individual solitons 
we have found for the two loops. It is visible from the data that for values of k away from k w the potential energy 
is indeed strongly dominated by the double well contribution i.e. second term in 0, as we have expected. 

We have found that folded proteins can be described by dark soliton solutions of a generalized discrete nonlinear 
Schrodinger equation. This equation involves only global parameters specific to a secondary superstructure, and 
the final protein configuration is determined by a single function. In the particular case of 1YRF where there are 
several high precision results to compare with, we have constructed a two-soliton configuration that describes the 
native backbone with an atomary level accuracy which is around one Angstrom less than the present consensus value 




FIG. 2: Comparison between 1YRF backbone (red) and a soliton solution of UV (blue). The RMSD distance is 0.74 A. 
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IJ12 


1 st set 


-0.000646646 


0.227432 


0.0141014 


0.00162415 


-0.0051673 


1.68028 


1.68844 


2 nd set 


-0.0001126726 


0.418995 


0.000670547 


0.00025209 


-0.000318858 


1.69553 


1.53529 


soliton- 1 


-0.000516175 


0.662187 


0.0081804 


0.00110988 


-0.00356352 


1.48643 


1.48167 


soliton-2 


-0.0000443408 


0.577717 


0.000294502 


0.0000936295 


-0.000132267 


1.53816 


1.54597 



TABLE I: The parameter values for the two-soliton solution that describes the entire 1YRF protein with accuracy 0.74 A, for 
its first (1 st ) loop (sites 2-13) and second (2 nd ) loop (sites 14-33). We also present the parameter values for a dark soliton 
(soliton-1) that describes the first loop with accuracy 0.76 A, and the corresponding values for a dark soliton (soliton-2) that 
describes the second loop with accuracy 0.58 A. 

obtained in molecular dynamics simulations. Among our future challenges is the enumeration and modeling of the 
different secondary superstructures in PDB and to develop a soliton basis for protein structure prediction. Indeed, 
we find it remarkable that in our construction we assume nothing about the details of the amino acid sequence, we 
only describe a homogeneous C a backbone. Thus it is very unlikely that the common point of view that folding 
is mainly driven by side-chain interactions can be the full explanation. Instead, our results suggest the presence of 
a strong contribution from backbone hydrogen bonding [21] . [22j . The detailed amino acid structure then breaks 
the translation invariance along an otherwise homogeneous chain, and amino acids in particular structural disruptor 
proline determine the location and the size of the loops a.k.a. dark solitons. In this manner the folding geometry is 
dictated by genome. 

Our research is supported by grant from the Swedish Research Council (VR). N.M. thanks M. Herrmann for 
communications. We all thank Martin Lundgren for discussions. 
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